Local and global topics in text modeling of web pages nested in web sites
نویسندگان
چکیده
Topic models assert that documents are distributions over latent topics and words. A nested document collection has inside a higher order structure such as articles in journals, podcasts within authors, or web pages sites. In single of documents, global shared across all documents. For sites, topic frequencies likely vary sites site, almost certainly from page to page. hierarchical prior for this with distribution, site varying around the distribution. Web one United States local health department often contain geographic news not found on other some unique an individual site. Regular ignore nesting may identify but cannot label those nor corresponding owner. Explicitly modeling identifies owning local. US data, coverage is defined at level after removing words pages. Hierarchical can be used study how well covered.
منابع مشابه
Analyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کاملAdaptive Web Sites: Automatically Synthesizing Web Pages
The creation of a complex web site is a thorny problem in user interface design. In IJCAI ’97, we challenged the AI community to address this problem by creating adaptive web sites: sites that automatically improve their organization and presentation by mining visitor access data collected in Web server logs. In this paper we introduce our own approach to this broad challenge. Specifically, we ...
متن کاملIdentifying Corporate Managerial Topics with Web Pages
This paper has as its main aim to analyse how corporate web pages can become an essential tool in order to detect strategic trends by firms or sectors, and even a primary source for benchmarking. This technique has made it possible to identify the key issues in the strategic management of the most excellent large Spanish firms and also to describe trends in their long-range planning, a way of w...
متن کاملLocal Aspects of the Global Ranking of Web Pages
Started in 1998, the search engine Google estimates page importance using several parameters. PageRank is one of those. Precisely, PageRank is a distribution of probability on the Web pages that depends on the Web graph. Our purpose is to show that the PageRank can be decomposed into two terms, internal and external PageRank. These two PageRanks allow a better comprehension of the PageRank sign...
متن کاملText Categorization of Commercial Web Pages
In this paper we describe a new on-line document categorization strategy that can be integrated within Web applications. A salient aspect is the use of neural learning in both representation and classification tasks. Within text documents conceived as images, the regions of interest (RoI) containing information meaningful for categorization are identified with the support of a supervised neural...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computational Statistics & Data Analysis
سال: 2022
ISSN: ['0167-9473', '1872-7352']
DOI: https://doi.org/10.1016/j.csda.2022.107518